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1. INTRODUCTION 

The XGBoost algorithm was introduced by Chen and Guestrin [1] and made up an effective 
direction of applying in machine learning. There have been many studies which successfully employed the 
XGBoost model to solve time series forecasting problems, such as forecasting stock price [2], energy [3], 
hemorrhagic fever [4], oil price [5], and traffic flow [6]. Not out of this trend, the power load forecasting 
problem has also been investigated by many scholars using the XGBoost model and obtained impressive 
results [7]—[12]. One of the features of the XGBoost model is that its accuracy depends on hyperparameters 
including the number of gradient boosted trees, maximum tree depth, boosting learning rate, minimum sum 
of instance weight, subsample ratio of columns, and so on. Therefore, determining the optimal 
hyperparameters is essential for the application of XGBoost model [13]-[15]. Several algorithms have been 
used to determine these optimal hyperparameters, for which the grid search (GS) algorithm combined with 
cross-validation technique is preferred to use due to high efficiency and simplicity. The GS algorithm runs a 
search over all hyperparameter sets in a grid space while recording error metric — the criterion for evaluation 
of model performance. GS algorithm returns the optimal model with optimal hyperparameters based on a 
selection criterion for getting the smallest of error metric in the training process [16]-[19]. Objectively, the 
minimum value of a dataset will normally fluctuate, so the obtained optimal model of the GS algorithm 
determined during the training process may not be the best value in the testing process. 
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In this regard, the present work proposes a GS algorithm based on the median values instead of the 
minimum values. The boxplot chart allows to analyze statistical characteristics of the data according to 5 
distribution positions embedded in the proposed GS algorithm, namely, minimum value (min), first quartile 
(Q1), median (Q2), third quartile (Q3), and maximum value (max) [17], [20], [21]. Method used to compare 
the proposed algorithm with the traditional one is set up to evaluate the accuracy of the two models. daily 
load data of the Ho Chi Minh City (HCM) (Vietnam) and Tasmania state (TA) (Australia) were employed in 
the experiments to verify the accuracy of this study. 

The rest of the paper is organized as follows. Sections 2 and 3 describe an overview of the XGBoost 
model, GS algorithm, as well as the proposed GS algorithm and the method for evaluating both algorithms. 
Section 4 presents the analysis and discussion of the experimental results. Conclusions of the paper are given in 
section 5. 


2. METHOD 
2.1. The XGBoost model 

XGBoost, a kind of boosting algorithm, is a powerful method for regression as well as classification 
[22], [23]. Support the dataset D={(xi, yi)} (xi E Rm, yi E R), m is the dimension of sample and n is the number 
of samples. A tree ensemble model including K decision trees with predicted value is obtained by: 


Pi = Dhar fe (Xi), fe © F (1) 
F = {f (x) = wgw} (q: R” > T,w € R") (2) 


where F represents the space of regression trees, q indicates the structure of each tree, T is the number of 
leaves in the tree, œ is the leaf weight, and fẹ corresponds to an independent tree. 
The goal of the model is to learn the (1), and the objective function defined as (3): 


LCP) = il Oui) + Er Ak) (3) 


where ¥; and y; are the predicted and real values; | is the training loss measuring the difference between 7 
and y; Q is the complexity of the model, which is used to prevent over fitting of the model, Q is indicated by: 


AP) = yT + <Alloll? (4) 


In (4), y is the penalty coefficient which controls the complexity of the model; à is the penalty coefficient of 


leaf weight. Let vee is the prediction of the i-th instance at the t-th iteration, and pe can be obtained in (5): 


IO = IE + fle (5) 
Substituting (5) into the objective (3), the objective function can be rewritten as (6): 
LO = FR LOI + fle) + 2G) (6) 


To improve the convergence speed and accuracy, the second-order Taylor approximation is used, 
and the (6) is transformed into (7): 


LO = SE LOI + gifelad +h + 2G) 
gi = Ige—yl(Y, IO), hy = AF e-yl OIE) (7) 


where gi, h; are first and second order gradient statistics of loss function. Remove the constant term, the 
specific objective function at step t is being the new optimization goal for the new tree as (8): 


DO = ILIFE) + hf i] + 2G) 9 


Considering that Ij={iļq(xi)=j} is the instance set of leaf j, replacing f,(xi) by the tree definition @q%), 
expanding Q, then (8) can be changed into (9): 


IO = Xia [oife + ~nif2(x)| +yT + 2A Eja wj 
1 
= Djal ier, 91); + 5 (Lier, bi + oF] +yT (9) 
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The optimal weight @; of leaf j can be obtained in (10): 


Mier; 9 10 
eae Bier; Mita Oo) 
And then the tree with the corresponding optimal value is (11): 
z (Zier; 91)" 
Og = -F7 L 
LO (q) = 2-1 Jeha t YT (11) 


In (11) is used as a scoring function to measure the quality of a q-tree structure. 


2.2. The hyperarameters of XGBoost 
XGBoost is a powerful algorithm. So, it will take more design decisions and hence large 
hyperparameters [24]—-[26]. Hyperparameters are certain values or weights that define the learning process of 
an algorithm. The hyperparameters of the XGBoost model can be classified into three categories: 
- General parameters: define the overall functionality of XGBoost model, such as booster, verbosity, and nthread 
- Booster parameters: control the performance of the XGBoost model, such as learning_rate (LR), 
min_child_weight (MC), max_depth (MD), subsample, lambda, and alpha. 
- Learning task parameters: define the optimization objective to be calculated at each step, such as 
objective, eval_metric, and seed. 
So, there are many tuning hyperparameters for tree-based learners in the XGBoost model, and the most 
common ones are described in Table 1. 


Table 1. The common hyperparameters of XGBoost 
Hyperparameters Definition 
booster Select the type of model to run at each iteration 
The number of parallel threads used to run XGBoost, is used for parallel 


niread processing and number of cores in the system should be entered 

learning_rate Shrinking the step size is used to prevent overfitting. Range is [0,1] 
min_child_weight Determines the minimum weighted sum of all required observations in a child 
max_depth Determines how deep each tree is allowed to grow in any boost loop 
subsample Percentage of samples used per tree. Low values can lead to underfitting 
objective Defines the loss function to be minimized 

eval_metric The metric to be used for validation data 


3. GRID SEARCH AND PROPOSED GRID SEARCH ALGORITHM 
3.1. The original grid search algorithm 

The accuracy of machine learning models in general and XGBoost networks in particular depends 
on their hyperparameters. There are many algorithms used to determine these optimal hyperparameters, 
typically, GS, random search (RS), genetic algorithm (GA), particle swarm optimization (PSO), bayesian 
optimization (BO) [13], [14], [25]. In which, the GS algorithm will be explored in this study. The principle of 
the original GS algorithm is to generate a grid of possible values for the hyperparameters. During iteration, 
the hyperparameters will be combined in a specific order, fits the model, and recorded the performance (error 
metric) of the model. Finally, the algorithm determines the optimal hyperparameters with the best 
performance. In the paper, the procedure of the original GS algorithm is reviewed based on the boxplot 
definition, which the form of boxplot is illustrated in Figure 1. Boxplot is a method for describing the 
distribution of data in statistics [20], [21]. It determines the lower quartile (Q1), median (Q2), and upper 
quartile (Q3) values. The interquartile quartile range (IQR)=Q3-Q1, the maximum whisker length is 
IQR*1.5, and outliers are points which lies outside that range. 


Outliers Q1 Q2 Q3 


>44 >l 


1.51QR IQR 1.5 IQR 
Lower Upper 
whisker whisker 


Figure 1. The boxplot definition 
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The XGBoost have many hyperparameters as shown in Table 1, and in this paper the author focuses 
on three hyperparameters namely LR, MD, and MC. The procedure of the GS algorithm with three 
hyperparameters based on the boxplot is shown in Figure 2. Firstly, it is needed to determine the range of 
hyperparameters to tune and their search space. The next step is a process of finding an optimal value of LR 
hyperparameter (LRop) to achieve minimum error loss based on the boxplot drawn by LR. Fixed LR=LRop, 
the new search space is created by only the combination of MD and MC hyperparameters, and the optimal 
value of the MD hyperparameter is obtained by the same process of the LR hyperparameter. Continuously 
fixed MD=MDop, the search space is made up of just the range of the last hyperparameter MC, and then we 
also obtain the optimal value MCop. The output of this procedure is the optimal hyperparameter {LRop, 
MDopt, MCopt}. 


Grid space: 
LR= {r1, [fener rlr} 


< 


- Draw boxplot of error loss by LR 
- Obtain the optimal value LRopt (the 
lowest value of error loss) 


error loss 


Grid space: 
Fixed LR=LRopt 


- Draw boxplot of error loss by MD 
- Obtain the optimal value MDop (the 
lowest value of error loss) 


error loss 


i 


Grid space: 
Fixed LR=LRopt 
Fixed MD=MDopt eA a NSS a 1 
MC = {C1; Cz,..., Cmc} i ' 


! , 


- Draw boxplot of error loss by MC 
- Obtain the optimal value MCop (the 
lowest value of error loss) 


i 


LR = LRopt 
MD = MDopt 
MC = MCopt 


error loss 


Figure 2. The procedure of the original grid search for XGBoost 


To enhance the effeteness, the cross-validation technique is commonly combined with GS algorithm 
for searching optimal parameters. The technique is performed using k-fold cross validation. The data is 
divided into k equal subsets where one subset is used for testing while the remaining ones are used for 
training purpose. Once the model has been trained k times, the overall training performance is evaluated by 
the average of the training results obtained in each iteration. 


3.2. The proposed grid search algorithm 

In this paper, the proposed algorithm is based on the procedure of the original GS algorithm 
described above, where the difference is that the processes for determining the optimal hyperparameters are 
based on the minimum value of the median, instead of the minimum values of the original GS algorithm. The 
procedure of the proposed GS algorithm using the boxplot is shown in Figure 3. In Figure 3(a), model 1 
presents the procedure based on the median values, corresponding to the order of sequentially determined 
optimal parameters of LR, MD and MC. As a result of this procedure, the optimal hyperparameters {LRopt-1, 
MDopt-1, MCopt-1} are obtained. Since the selection criterion is based on median values, changing the sequence 
of the obtained hyperparameters for the XGBoost model may lead to different optimal values. In this study, 
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we just consider 03 hyperparameters of LR, MD, MC for the XGBoost model, thereby it allows to establish 5 
more sequence combinations of hyperparameters, corresponding to models from 2 to 6. These models are 
also used in turn with the same procedure of model 1, thereby determining a total of 6 optimal sets of 
hyperparameters. In the last step as shown in Figure 3(b), the error metric values of 6 optimal sets of 
hyperparameters are evaluated to obtain the set of hyperparameter with the smallest error metric, which is the 
output of the proposed GS algorithm. 


Model 1 


Grid space: 
LR = {r1, r2,- fur} 
MD = {di, dp,....,dmo} ae 


Y 
- Draw boxplot of error loss by LR 


error loss 
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Y 
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Y 
- Draw boxplot of error loss by MD 
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- Obtain the optimal value MDop:+ (the 
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Y 


- Draw boxplot of error loss by MC 
- Obtain the optimal value MCaopt-1 (the 
lowest value of error loss) 


error loss 


i 
LR = LRopt-1 

MD = MDope-1 
MC = MCopt-1 
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Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 


'LR= LRopt-1 ILR = LRop-2 IMD = MDop-3 | MD = MDop-4 | MC = MCope-s IMC = MCop6 ! 
{MD = MDopi-1 | MC = MCop-2 | MC = MCope3 |LR = LRop4 | MD =MDopis|LR=LRop.s | 
{MC = MCop-1 | MD = MDop-2 ! LRop-3_ | MC = MCopi4 {LR = LRopes |! MD = MDopt6 | 


o C Se E E E 


Obtain the optimal hyperparameter (the lowest value of error loss) 


’ 


LR = LRoy, MD = MDopi, MC = MCopt 


(b) 


Figure 3. The procedure of the proposed grid search for XGBoost (a) the procedure of the model | based on 
the median values and (b) the combination of 6 models 


3.3. Method for evaluating 

Figure 4 shows a method used as a benchmark for the proposed GS algorithm. This method consists 
of three processes: data extracting, training and testing. The data extracting process: the data is divided into 
training and testing dataset. The training dataset (Xtain, Y train) may be used to train the model while the testing 
dataset (Xtest, Ytest) - evaluate the proposed and original algorithms. 

The training processing: Define the search space based on the combination of the tuning values of 
the hyperparameters LR, MD, MC. The original and proposed GS algorithm will be performed according to 
the procedure illustrated in Figures 2 and 3, respectively. After this step, the optimal hyperparameters of 
original and proposed GS algorithm will be obtained. Performance of the original and proposed GS 
algorithms is evaluated using mean squared error (MSE) error loss defined as [27], [28]. 
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where [y1, Y2, ..., Yn] and [J,, Jo, ...,9,] are the test and prediction values, respectively. The testing 
process: Both these optimal XGBoost models will generate the predicted value Y predict, then calculate the 
error metric MSE. And these error metrics of the original and proposed GS algorithms will be evaluated by 
comparing to each other. 


Data 


| Process data 


Xtrain, Ytrain Nests Ytest 
en E Extracting 
y 
Search | Original Proposed |ą Search 
space Grid Search Grid Search space 
Training 
{£opts Copts Yopt} hi eee (Cee 
Xtest 
XGBoost XGBoost 
Y predict Y predict test 
Error Error 
metric metric 
MSE MSE ---- === Testing 


Figure 4. The method of evaluating the proposed algorithm 


4. EXPERIMENTS RESULTS 
4.1. Parameter settings and dataset description 

This study employed the GS CV tool provided by scikit-learn [29] to implement the GS algorithm 
using the cross-validation technique. The experiments were conducted on Google Colab’s TPU [30]. Table 2 
below highlights the hyperparameters tuning considered in this study. The length of LR hyperparameter of 
30, MD and MC of 20, so the total number of combinations is equal to 30x20x20=12000. In addition, we 
will perform the original and original GS algorithms using k-fold cross-validation with 2 folds. 

The electric load data of both TA (Australia) and HCM (Vietnam) are used in the paper to verify the 
effectiveness of the proposed algorithm. The Tasmania dataset records the electric load every half-hour, 
produces 48 daily points. Meanwhile, the Ho Chi Minh dataset records the electric load every hour, produces 
24 daily points. Here we considered 63 days in the provided dataset with the training dataset for two typical 
months of 56 days and the testing dataset for one week of 7 days. The dataset description of TA and HCM is 
shown in the Table 3. 


Table 2. The ranges and options for hyperparameter 


o. : Range 
Abbreviation Hyperparameter min a step 
LR learning_rate 0.01 30 0.01 
MD max_depth 1 20 1 
MC min_child_weight 1 20 1 


Table 3. The data description 


Data Training Testing 
TA - Dimension: - Dimension: 
Xtrain? (2688, 48), Y rain: (2688,) Xtest? (336, 48), Yes: (336,) 
- Time: from 3/23/14 to 5/17/14 - Time: from 5/18/14 to 5/24/14 
HCM - Dimension: - Dimension: 
Xerain: (1344, 24), Yain: (1344,) Xes: (168, 24), Yes: (168,) 
- Time: from 10/22/18 to 12/16/18 - Time: from 12/17/18 to 12/23/18 
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4.2. The training process 

Figure 5 illustrates step by step the procedure of the original GS algorithm for the case of HCM data 
described in Figure 2. Figure 5(a) shows the boxplot of error loss by LR and the obtained optimal value of 
LR=0.11. At the same time, the boxplot of error loss by MD is presented in Figure 5(b) with the optimal 
value of MD=3. The last one is reported in Figure 5(c) with the error loss by MC and the optimal value of 
MC=5. So, the optimal hyperparameter of the original GS algorithm is {LR=0.11, MD=3, MC=5} for the 
case of HCM. That is also the result of the training process of the original GS algorithm which is shown in 
Figure 4. The same result was also obtained for the case of TA data as shown in Table 4. 
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® 20000 | 1 9 | 
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S 15000 MTT HHUH ê ii 
H 10000 A ETEETETTFITETT gag F 
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SOSSC OCS OG ddA AAA GANAANAANANAS 
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Figure 5. The training process of original GS algorithm, HCM (a) hyperparameter LR, (b) hyperparameter 
MD, and (c) hyperparameter MC 


Table 4. The training process of the original grid search 


Data Optimal hyperparameters Error loss (MSE) 
HCM LR=0.11; MD=3; MC=5 8284.3 
TA LR=0.27; MC=1; MD=4 805.2 


Figure 6 reports step by step the procedure of the proposed GS algorithm (model 1) in the case of 
HCM as described in Figure 3(a). The boxplots of error loss by LR and MD with the optimal values of 
LR=0.03 and MD with the optimal value as MD=5 is illustrated in Figures 6(a) and 6(b), respectively. 
Figure 6(c) shows the boxplot of error loss by MC with the optimal value of MC=12. As a result, the optimal 
hyperparameter of the model | based on the proposed GS algorithm is {LR=0.03, MD=5, MC=12}. 
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In the same way, the optimal hyperparameters of Models 2 to 6 were obtained as shown in Table 5. 
Comparing these models and choosing the model with the smallest error loss values allowed to obtain the 
optimal hyperparameter of {MD=4; MC=5; LR=0.11} for the proposed GS algorithm toward the case of 
HCM data. In the case of TA data, the optimal hyperparameter was obtained with {MD=1; MC=6; LR=0.12} 
(Table 6). Note that during training, the error loss value of the original GS algorithm (8284.3 MW for HCM 
data, and 805.2 MW for TA data) is obviously smaller than that of the proposed GS algorithm with 8,322.14 


and 806.26 MW for TA and HCM data, respectively. 
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Figure 6. The training process of proposed GS algorithm, model 1, HCM (a) hyperparameter LR, 
(b) hyperparameter MD, and (c) hyperparameter MC 


Table 5. The proposed grid search, training process, HCM 


Model Optimal hyperparameters Error loss (MSE) 
1 LR=0.03; MD=5; MC=12 8,989.92 
2 LR=0.03; MC=17; MD=12 9,055.84 
3 MD=4; MC=5; LR=0.11 8,322.14 
4 MD=4; LR=0.11; MC=5 8,322.14 
5 MC=17; MD=3; LR=0.16; 9,168.17 
6 MC=17; LR=0.03; MD=12 9,055.84 


The smallest error loss: MD=4; MC=5; LR=0.11 


Table 6. The proposed grid search, training process, TA 


Model Optimal Hyperparameters Error loss (MSE) 
1 LR=0.02; MD=20; MC=6 814.53 
2 LR=0.02; MC=6; MD=16 810.35 
3 MD=1; MC=6; LR=0.12 806.26 
4 MD=1; LR=0.12; MC=6 806.26 
5 MC=7; MD=1; LR=0.13 819.34 
6 MC=7; LR=0.02; MD=19 830.42 


o 


The smallest error loss: MD=1; MC=6; LR=0.12 


4.3. The testing process 

Based on the training process given in subsection 4.2, the optimal hyperparameter of the original 
and proposed GS algorithms can be obtained as shown in the column ‘optimal hyperparameters’ of the 
Table 7. The last column of the Table 7 depicts the error metrics MSE in the testing process of these optimal 
hyperparameters for the original and proposed GS algorithms using the data of HCM and TA, respectively. It 
is indicated that the error metric MSE for the proposed algorithm has values of 2,282 MW and 501 MW i.e. 
smaller than those for original one with 2,424 MW and 537 MW, respectively. Apparently, the results 
demonstrate a huge advantage of the proposed over original algorithm by means of the error metrics. 
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Table 7. The results of testing process 
Optimal hyperparameters 


Error metric MSE 


Data Method ae (in the See tae dain the testing process) 
HCM Original grid search 0.11 2) 5 2,424 
Proposed grid search 0.11 4 5 2,282 
TA Original grid search 0.27 1 4 537 
Proposed grid search 0.12 1 6 501 
CONCLUSION 


This paper proposes a new GS algorithm for obtaining the optimal hyperparameters of the XGBoost 


model. The proposed algorithm is established based on the minimum median values of the error loss instead 
of the minimum values for the original algorithm. The boxplot distribution is embedded to conduct the 
proposed and original GS algorithms. The benchmark method is capable of evaluating the performance of the 
proposed and original GS algorithms using the daily electric load demand of the HCM, Vietnam and TA 
state, Australia. According to the experimental results, the satisfying performance of the proposed algorithm 
over the original one was demonstrated to verify the effectiveness of the proposed GS algorithm. 
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